Many-body approach to the dynamics of batch learning 
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Using the cavity method and diagrammatic methods, we model the dynamics of batch learning of 
restricted sets of examples, widely applicable to general learning cost functions, and fully taking 
into account the temporal correlations introduced by the recycling of the examples. 
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The extraction of input-output maps from a set of ex- 
amples, usually termed learning, is an important and in- 
teresting problem in information processing tasks such 
as classification and regression j^]. During learning, one 
defines an energy function in terms of a training set of 
examples, which is then minimized by a gradient descent 
process with respect to the parameters defining the input- 
output map. In batch learning, the same restricted set of 
examples is provided for each learning step. There have 
been attempts using statistical physics to describe the 
dynamics of learning with macroscopic variables. The 
major difficulty is that the recycling of the examples in- 
troduces temporal correlations of the parameters in the 
learning history. Hence previous success has been lim- 
ited to Adaline learning , linear perccptrons learning 
nonlinear rules 1^,^, Hebbian learning and binary 
weights ||]. 

Recent advances in on-line learning are based on the 
circumvention of this difiiculty. In contrast to batch 
learning, an independent example is generated for each 
learning step [|lO|,0 . Since statistical correlations among 
the examples can be ignored, the dynamics can be simply 
described by instantaneous dynamical variables. How- 
ever, on-line learning represents an ideal case in which 
one has access to an almost infinite training set, whereas 
in many applications, the collection of training examples 
may be costly. 

In this paper, we model batch learning of restricted 
sets of examples, by considering the learning model as a 
many-body system. Each example makes a small contri- 
bution to the learning process, which can be described by 
linear response terms in a sea of background examples. 
Our theory is widely applicable to any gradient-descent 
learning rule which minimizes an arbitrary cost function 
in terms of the activation. It fully takes into account 
the temporal correlations during learning, and is exact 
for large networks. Preliminary work has been presented 
recently (l2|. 

Consider the single layer perceptron with ^ 1 in- 
put nodes {^j} connecting to a single output node by the 
weights {Jj} and often, the bias as well. For conve- 
nience we assume that the inputs are Gaussian vari- 
ables with mean and variance 1, and the output state 
5 is a function f{x) of the activation x at the output 
node, i.e. S = .f{x)] x = J ■ ^ + 9. For binary outputs, 
f{x) = sgnx. 



The network is assigned to "learn" p = aN exam- 
ples which map inputs {^j} to the outputs {5^} (/i = 
1, . . . ,p). In the case of random examples, are ran- 
dom binary variables, and the perceptron is used as a 
storage device. In the case of teacher-generated exam- 
ples, are the outputs generated by a teacher percep- 
tron with weights {Bj} and often, a bias as well, namely 

Batch learning is achieved by adjusting the weights 
{Jj} iteratively so that a certain cost function in terms 
of the activations {x^} and the output of all exam- 
ples is minimized. Hence we consider a general cost func- 
tion E = — g{x^, Ufj). The precise functional form of 
g{x,y) depends on the adopted learning algorithm. In 
previous studies, g{xAj)^ —{S — x)'^ /2 with S = sgmj 
in Adaline learning &H], and g{x,y) = xS in Hebbian 
learning 

To ensure that the perceptron is regularized after 
learning, it is customary to introduce a weight decay 
term. In the presence of noise, the gradient descent dy- 
namics of the weights is given by 



dt 



1 J2 9'M),y^)^^ - XJ, (t) + 77, (i), (1) 



where the prime represents partial differentiation with 
respect to x, X is the weight decay strength, and r/j{t) 
is the noise term at temperature T with {rij{t)) = and 
{'rij{t)r]k{s)) = 2TSjk5{t—s)/N. The dynamics of the bias 
9 is similar, except that no bias decay should be present 
according to consistency arguments , 



d9{t) 
dt 



(2) 



Our theory is the dynamical version of the cavity 
method |l3Hl5|. It uses a self-consistency argument to 
consider what happens when a new example is added to 
a training set. The central quantity in this method is 
the cavity activation, which is the activation of a new 
example for a perceptron trained without that example. 
Since the original network has no information about the 
new example, the cavity activation is random. Here we 
present the theory for 9 = (j) = skipping extensions to 
biased perceptrons. Denoting the new example by the 
label 0, its cavity activation at time t is hf){t) = J{t) ■ ^. 
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For large N, ho{t) is a Gaussian variable. Its covariance is 
given by the correlation function C(t, s) of the weights at 
times t and s, that is, {ho{t)ho{s)) = J{t) ■ J{s) = C{t, s), 
where and are assumed to be independent for j ^ k. 
For teacher-generated examples, the distribution is fur- 
ther specified by the teacher-student correlation R{t), 
given by {ho{t)yo) = J>) • B = R{t). 

Now suppose the perceptron incorporates the new ex- 
ample at the batch-mode learning step at time s. Then 
the activation of this new example at a subsequent time 
t > s will no longer be a random variable. Furthermore, 
the activations of the original p examples at time t will 
also be adjusted from {x^i{t)} to {x°^{t)} because of the 
newcomer, which will in turn affect the evolution of the 
activation of example 0, giving rise to the so-called On- 
sager reaction effects. This makes the dynamics complex, 
but fortunately for large p ~ TV, we can assume that the 
adjustment from to a;J^(i) is small, and linear re- 

sponse theory can be applied. 

Suppose the weights of the original and new perceptron 
at time t are {Jijt)} and {Jj{t)} respectively. Then a 
perturbation of (]lj) yields 



d 



1 



- + A ) (J°(t) - J,{t)) = -g'{xo{t),yoX 



N- 



(3) 



The first term on the right hand side describes the pri- 
mary effects of adding example to the training set, 
and is the driving term for the difference between the 
two perceptrons. The second term describes the many- 
body reactions due to the changes of the original exam- 
ples caused by the added example. The equation can be 
solved by the Green's function technique, yielding 



(4) 



where (?o(^) — 9' {xo{s),yQ) and Gjk{t,s) is the weight 
Green's function^ which describes how the effects of a 
perturbation propagates from weight Jk at learning time 
s to weight Jj at a subsequent time t. In the present 
context, the perturbation comes from the gradient term 
of example 0, such that integrating over the history and 
summing over all nodes give the resultant change from 

J At) to J^{t). 

For large N the weight Green's function can be found 
by the diagrammatic approach. The result is self- 
averaging over the distribution of examples and is di- 
agonal, i.e. limiv^oo Gjk {t, s) — G{t, s)Sjk, where 

G{t,s) =G^°\t- s)+a I dti I dt2G^^\t-ti) 



{g';,{h)D^{h,h))Gih,s). (5) 

G(o)(t-s) EE e(t-s)exp(-A(<-s)) is the bare Green's 
function, and is the step function. Z?p(t, s) is the ex- 
ample Green's function given by 



D^{t, s) ^6it-s)+ dt'Git, t')g''(t')D^{t', s). (6) 



Our approach to the macroscopic description of the learn- 
ing dynamics is to relate the activation of the examples 
to their cavity counterparts. Multiplying both sides of 
(0) and summing over j , we get 



xo{t)-ho{t) = / dsG{t,s)g'o{s) 



(7) 



The activation distribution is thus related to the cavity 
activation distribution, which is known to be Gaussian. 
In turn, the covariance of this Gaussian distribution is 
provided by the fluctuation-response relation 



C(t,s)^a j dt'G^°\t~t'){g'^{t')x^{s)) 

+2T f dt'G^"\t-t')G{s,t'). (8) 



Furthermore, for teacher-generated examples, its mean is 
related to the teacher-student correlation given by 



Rit) = a dt'G'^'Ht-t'){g'{t')y,). 



(9) 



To monitor the progress of learning, we are interested 
in three performance measures: (a) Training error et, 
which is the probability of error for the training exam- 
ples, (b) Test error Ctest, which is the probability of error 
when the inputs of the training examples are corrupted 
by an additive Gaussian noise of variance A^. This is 
a relevant performance measure when the perceptron is 
applied to process data which are the corrupted versions 
of the training data. When = 0, the test error re- 
duces to the training error, (c) Generalization error Sg 
for teacher-generated examples, which is the probability 
of error for an arbitrary input when the teacher and 
student outputs are compared. 

The cavity method can be applied to the dynamics 
of learning with an arbitrary cost function. When it is 
applied to the Hebb rule, it yields results identical to 
[Q. Here for illustration, we present the results for the 
Adaline rule. This is a common learning rule and bears 
resemblance with the more common back-propagation 
rule. Theoretically, its dynamics is particularly con- 
venient for analysis since g"{x) — —1, rendering the 
weight Green's function time translation invariant, i.e. 
G{t,s) = G{t — s). In this case, the dynamics can be 
solved by Laplace transform, and the cavity approach fa- 
cilitates a deeper understanding than previous studies. 
Illustrative results are summarized with respect to the 
following aspects: 

1) Overtraining of eg: As shown in Fig. 1, eg decreases 
at the initial stage of learning. However, for sufficiently 
weak weight decay, it attains a minimum at a finite learn- 
ing time before reaching a higher steady-state value. This 
is called overtraining since at the later stage of learning, 
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the perceptron is focusing too much on the specific de- 
tails of the training set. In this case eg can be optimized 
by early stopping^ i.e. terminating the learning process 
before it reaches the steady state. Similar behavior is 
observed in linear perceptrons |^-^. 

This phenomenon can be controlled by tuning the 
weight decay A. The physical picture is that the per- 
ceptron with minimum corresponds to a point with a 

magnitude | J*|. When A is too strong, \J\ never reaches 
this magnitude and eg saturates at a suboptimal value. 
On the other hand, when A is too weak, \J\ grows with 
learning time and is able to pass near the optimal point 
during its learning history. Hence the weight decay Aot 
for the onset of overtraining is closely related to the op- 
timal weight decay Xopt at which the steady-state eg 
is minimum. Indeed, at T = and for all values of 
a, Aot = Aopt — 7r/2 — 1; the coincidence of Aot and 
Aopt is also observed previously ||]. Early stopping for 
A < Aot = Aopt can speed up the learning process, but 
cannot outperform the optimal result at the steady state. 
A recent empirical observation confirms that a careful 
control of the weight decay may be better than early 
stopping in optimizing generalization [p^ . 

At nonzero temperatures, we find the new result that 
Aot and Aopt niay become different. While various sce- 
narios are possible, here we only mention the case of suf- 
ficiently large a. As shown in the inset of Fig. 1, Aopt 
lies inside the region of overtraining, implying that even 
the best steady-state eg is outperformed by some point 
during its own learning history. This means the optimal 
eg can only be attained by tuning both the weight de- 
cay and learning time. However, at least in the present 
case, computational results show that the improvement 
is marginal. 
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FIG. 1. The evolution of the generalization error at 
a = 0.8 and T = for different weight decay strengths A. 
Theory: solid line, simulation: symbols. Inset: The temper- 
ature dependence of the optimal weight decay Aopt (dashed) 
and the onset of overtraining Aot (solid) at a = 5. 

2) Overtraining of etest- This is best understood by 
considering the effects of tuning the input noise from 
zero, when etest starts to increase from et. At the steady 



state et is optimized by A = for a < 1, and by a rel- 
atively small A > for a > 1. This means that etest is 
optimized with no or only little concern about the mag- 
nitude of J^. However, when input noise is introduced, 
it adds a Gaussian noise of variance to the activa- 

tion distribution. The optimization of etest now involves 
minimizing the error of the training set without using 
an excessively large J^. Thus the role of weight decay 
becomes important. Indeed, at T = 0, Aopt = cxA^ for 
random examples, whereas Aopt oc approximately for 
teacher-generated examples. This illustrates how the en- 
vironment in anticipated applications, i.e. the level of 
input noise, affects the optimal choice of perceptron pa- 
rameters. 

Analogous to the dynamics of eg, overtraining can oc- 
cur when a sufficiently weak A allows J to pass near 
the optimal point during its learning history. Indeed, 
at r = the onset of overtraining is given by Aot = Aopt 
for random examples, whereas Aot ~ Aopt for teacher- 
generated examples. At nonzero temperatures, Aot and 
Aopt become increasingly distinct, and for sufficiently 
large a, Xopt < Xot as shown in the inset of Fig. 2, 
so that the optimal etest can only be attained by tuning 
both the weight decay and learning time. 

3) Average dynamics: When learning has reached 
steady-state, the dynamical variables fluctuates about 
their temporal averages because of thermal noises. If we 
consider a perceptron constructed using the thermally av- 
eraged weights {Jj)th, we can then prove that it is equiv- 
alent to the perceptron obtained at T = 0. This equiv- 
alence implies that for perceptrons with thermal noises, 
the training and generalization errors can be reduced by 
temporal averaging down to those at T = 0. 

We can further compute the performance improvement 
as a function of the duration r of the monitoring pe- 
riod for thermal averaging, as confirmed by simulations 
in Fig. 2. Note that the Green's function is a superpo- 
sition of relaxation modes exp(— fci) whose rate k lies in 
the range kmin 1!^ k kj^ax^ where k^ax and kmin are 
A -I- {y/azt 1)^ respectively. For a < 1, there is an addi- 
tional relaxation mode with rate A, which describes the 
relaxation by weight decay inside the N — p dimensional 
solution space of zero training error. Hence the monitor- 
ing period scales as k^]^ for a > 1, and A^^ for q < 1. 
Note that this time scale diverges for vanishing weight 
decay at a < 1. The time scale for thermal averaging 
agrees with the relaxation time proposed for asymptotic 
dynamics in j2j. 

We remark that the relaxation time for steady-state 
dynamics may not be the same as the convergence time 
for learning in the transient regime. For example, for 
a < 1 and vanishing weight decay at T = 0, significant 
reduction of et takes place in a time scale independent 
of A, since the dynamics is dominated by a growth of 
the projection onto the solution space of zero training 
error. On the other hand, the asymptotic relaxation time 
diverges as A~^. 



3 




10 20 30 



monitoring time 

FIG. 2. The training error at a = 0.1 and A = 5 of the 
thermally avoragod perceptron for random examples versus 
the duration t of the monitoring period for thermal averaging. 
Inset: The lines of the optimal weight decay \opt (dashed) 
and the onset of overtraining Aot (solid) of the test error for 
teacher-generated examples at a = 3 and T = 0.3. 

4) Dynamics of the bias: For biased perccptrons, 9{t) 
approaches the steady-state value erf((/)/\/2). (The fail- 
ure of the student to learn the teacher bias is due to the 
inadequacy of Adaline rule, and will be absent in other 
learning rules such as back-propagation.) 

The absence of bias decay modifies the dynamics of 
learning. For A < ^/a — 1, 9{t) consists of relaxation 
modes with rates k in the range kmin < k < kmax, as in 
the evolution of the weights. Hence the weights and the 
bias learn at the same rate, and convergence is limited 
by the rate kmin- However, for A > ^/a — 1, 0{t) has 
an additional relaxation mode with rate A = aA/(l + A). 
Since A < kmin, the bias learns slower than the weights, 
and convergence is limited by the rate A, as illustrated 
in Fig. 3 which compares the evolution of the weight 
overlap R{t) and 6{t). If faster convergence is desired, 
the learning rate of the bias has to be increased. 
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ity method, which is much more versatile than existing 
theories. It allows us to reach useful conclusions about 
overtraining and early stopping, input noise and temper- 
ature effects, transient and average dynamics, and the 
convergence of bias and weights. We consider the present 
work as only the beginning of a new area of study. Many 
interesting and challenging issues remain to be explored. 
For example, it is interesting to generalize the method 
to dynamics with discrete learning steps of finite learn- 
ing rates. Furthermore, the theory can be extended to 
multilayer networks. 

This work was supported by the Research Grant Coun- 
cil of Hong Kong (HKUST6130/97P). 
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FIG. 3. The evolution of the teacher-student weight over- 
lap R{t) and the bias e{t) at a = 0.8, A = 0.4 and T = 0. 

In summary, we have introduced a general framework 
for modeling the dynamics of learning based on the cav- 
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